[SYSTEMDS-2523] Update to Spark 2.4.6#992
Conversation
|
Current error in tests is only in python api that is missing some Java files in its distribution. |
|
The reason why it did not work was because the packaged jar file did not contain all the packages required now (due to me updating some dependencies). Therefore i decided to change the python to not decompress the distribution zip but use the environment variable SYSTEMDS_ROOT to locate the target folder and in turn all the jar files. This means we don't have to include java files in our python distribution, and the python execution comes closer to our The down side is that we have to install SystemDS or compile from scratch to make the Python package work, but i think it is fair to do that. |
|
Does this mean the python package (pip install) no longer contains the jar files? If you are using the target (as I already did at some point before) and use all those jars for the python package it becomes huge (>100MB), which I think is something we would like to not do and keep it relatively small. |
Yes it will no longer contain any jar files, making it much smaller. |
|
Shouldn't pip packages not be dependant on other ressources? That would at least make it much simpler to install and I would be much in favor of a single |
I agree, but we have to have the user setup java anyway and in that context i think installing and setting up systemds is fine. |
|
That also sounds like a better option (in my opinion), but then we would have to define a default path because setting an environment variable for the system will be hard. Also we would not like to download all the jars, which would save us no download work, but instead anyway set the jars to download, so why not do it when building the package and distribute it combined? That would also lead to all the data be in a single directory where pip packages are stored. Also setting up java is a single command (we should actually force java 8 execution, which we don't do atm jus |
I think this is just a decision we have to make at some point. there are many good options for install location
Well, currently i either had to figure out which packages specifically the Python lib require to execute, or use the packages from systemds that was downloaded anyway. To keep it simple in development i chose the later option. I agree that one install command through pip would be better, but i also would like that we don't install, copy or download the jars twice.
We need to move forward towards java 11. but not for this release: Hadoop 3.x and Spark 3.x use/support Java 11. |
|
I didn't go through all the comments in the cited PRs, but I'm curious to see how this Spark/Hadoop upgrade impacts the performance. It might not improve anything, but also should not deteriorate. I am not sure if we have a performance suit. Usually, it is better to fire a few performance tests to be very sure for these kinds of upgrades. |
If you have some scripts you could test it, unfortunately i don't have such a test suite yet, and currently i don't have access to a cluster with these versions install, once we get this we should as you say test it. I would not expect huge performance increases, but the main point of upgrading is not performance but to address the open CVEs and use the versions other ppl are using. |
True, and as you rightly said, we should move forward towards java 11 and Spark 3.x. Maybe not much needed for this PR, but extensive performance tests will be of the highest priority for java version upgrades. I have scripts, but those will not be sufficiently helpful for this, and also the cluster is a problem. I will work on this. |
Let's not forget about windows, but I agree that good options exist for all distros.
That was also the option I choose at first, but it seems a bit extreme that even if I choose python for development, as I expect many of the end users will do, I have to download way more jars which I will never make use of. Downloading twice is also a problem, since if I choose to write code in dml and python (how big do we expect the end user overlap to be?) that would be unnecessary. I think we should either make a blacklist of packages to not download/copy or let mvn create use a collection of exactly those packages that we need, in either case it is necessary to figure out which package are needed.
Then let's keep it as it is for now |
- Also bumps Hadoop version to 2.10 - Update to Index docs to reflect the change - Update Netty to 4.1.47.Final because spark use it (for our federated) - Update Jackson dependency Since Spark depends on it and we have other dependencies that overwrite the version - SystemDS context now print first message from JMLC Python API fixes changes since some dependencies changed. - New Python API start test - Pre setup now does not copy jar files from distribution
4682c2b to
93d47ad
Compare
|
Closing, because of plans to update Spark version after 2.0 release of systemds. |
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of SystemDS_context - disable catching of component test output
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context - Added to usertest
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context - Added to usertest
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context - Added to usertest
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in apache#992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context - Added to usertest
Minor changes in Python API startup for ease of startup if SystemDS is installed somewhere else it will use that SystemDS. This practically means that if you have SystemDS home set, it will allow the python to use that SystemDS, while if it is not set, it will default back to the installed jar files from the PIP install. This is a debated topic in #992, where it is argued that it would make it harder for a user if the PIP does not contain the jar files. - Fix dual setup of systemds_context - Added to usertest
See:
#857
#857 (comment)